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David I. August, Daniel A. Connors, Scott A. Mahlke, John W. Sias, Kevin M. Crozier, Ben-Chung Cheng, Patrick 
Eaton, Qudus B. Oianlran, Wen-mei W. Hwu 

April 1998 ACM SIGARCH Computer Architecture News , Proceedings of tiie 25th annual internationa 
symposium on Computer architecture, volume 26 issue 3 

Full text available: _=a .^.^ , «r>v iS n- i 

]g.p.GttlJj>:vLMpj„'^^^^^^ Additional Information: full citation , abstract , references; . citings , index terrr^s 

Site 

Explicitly Parallel Instruction Computing (EPIC) architectures require the compiler to express program Instruct 
level parallelism directly to the hardware. EPIC techniques which enable the compiler to represent control 
speculation, data dependence speculation, and predication have individually been shown to be very effective. 
However, these techniques have not been studied in combination with each other. This paper presents the IM 
EPIC Architecture to address the issues Involved In design ... 

^ EffecLof node„sjze..o 

Richard A. Hankins, Jignesh M. Patel 

June 2003 ACi^ SIG METRICS Performance Evaluation Review , Proceedings of the 2003 ACM SIGMET 
international conference on Measurement and modeling of computer systems, volume 3i iss 

Full text available: 'g.pd£271.,16. KB) Additional Information: Mi.ci5;a);iQn, abstract, nsfsr^nces, index.terrns 

In main-memory databases, the number of processor cache misses has a critical impact on the performance o 
system. Cache-conscious indices are designed to improve performance by reducing the number of processor c 
misses that are incurred during a search operation. Conventional wisdom suggests that the index's node size 
be equal to the cache line size in order to minimize the number of cache misses and improve performance. As 
show in this paper, this design choice ignores additi ... 



Keywords: B+-tree, cache-conscious, index 



DataSca!ar architectures 

Doug Burger, Stefanos Kaxiras, James R. Goodman 

May 1997 ACM SIGARCH Computer Architecture News , Proceedings of the 24th annual internationa 

symposium on Computer architecture, volume 25 issue 2 
Full text available: o6U2.i1 MB) Additional Information: ^-11 citation. 3t?straot . references , citings , index tenns 

DataScalar architectures improve memory system performance by running computation redundantly across m 
processors, which are each tightly coupled with an associated memory. The program data set (and/or text) is 
distributed across these memories. In this execution model, each processor broadcasts operands it loads from 
local memory to all other units. In this paper, we describe the benefits, costs, and problems associated with t 
DataScalar model. We also present simulation results of ... 
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^ High-bandwidth address transiation for multiple-Issue p r ocessors 
Todd M. Austin, Gurindar S. Sohi 

May 1996 ACM SIGARCH Computer Architecture News , Proceedings of the 23rd annual internation 

symposium on Computer architecture, Volume 24 issue 2 
Full text available: 'JpdK.lM.MBi Additional Information: M.citatign, abstract, references, citings, !ndex.terms 

In an effort to push the envelope of system performance, microprocessor designs are continually exploiting hi 
levels of instruction-level parallelism, resulting in increasing bandwidth demands on the address translation 
mechanism. Most current microprocessor designs meet this demand with a multi-ported TLB. While this desig 
provides an excellent hit rate at each port, its access latency and area grow very quickly as the number of po 
increased. As bandwidth demands continue to increase ... 

5 B inary translation and arch it ecture conve r gence issue s for iBM system/390 
Michael Gschwind, Kemal Ebcioglu, Erik Altman, Sumedh Sathaye 

May 2000 Proceedings of the 14th international conference on Supercomputing 

Full text available: 'Q.pdf(1,44.&iB} Additional Information: Mi.cjMiQn, abstract, relsjences, indexlerrris 

We describe the design issues in an implementation of the ESA/390 architecture based on binary translation t 
very long instruction word (VLIW) processor. During binary translation, complex ESA/390 Instructions are 
decomposed into Instruction "primitives" which are then scheduled onto a wide-issue machine. The aim is to a 
high instruction level parallelism due to the Increased scheduling and optimization opportunities which can be 
exploited by binary translation software ... 

6 A dynamic scheduhn g logi c for expioiting mustiple functional units in singie chip mu ltithread ed architect 
Prasad N. Golla, Eric C. Lin 

February 1999 Proceedings of the 1999 ACM symposium on Applied computing 

Full text available: ■ffi.pdf{1.19.MBi Additional Information: xglj.citatjgn, rMerences, indexienm 
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^ instr uction cache fet ch p oiicies for speculative execution 
Dennis Lee, Jean-Loup Baer, Brad Calder, Dirk Grunwald 

May 1995 ACM SIGARCH Computer Architecture News , Proceedings of the 22nd annual internation 
symposium on Computer architecture, volume 23 issue 2 

Full text available: '^pdf{J.,j.4i\/[B). Additional Information: fgJLcitatjon. abstract, references, citings, jndex„terrns 

Current trends in processor design are pointing to deeper and wider pipelines and superscalar architectures. T 
efficient use of these resources requires speculative execution, a technique whereby the processor continues 
executing the predicted path of a branch before the branch condition is resolved. In this paper, we investigate 
implications of speculative execution on instruction cache performance. We explore policies for managing inst 
cache misses ranging from aggressive po ... 

^ Compiier-based !/0 prefetching for out-of-core appiications 
Angela Demke Brown, Todd C. Mowry, Orran Krieger 

May 2001 ACM Transactions on Computer Systems (TOCS), volume 19 issue 2 

Full text available: Ddf(499,03 KB) Additional Information: .fuli.citation, aj?stract, references, citings, jrdex.terms, reyievy; 

Current operating systems offer poor performance when a numeric application's working set does not fit In m 
memory. As a result, programmers who wish to solve "out-of-core" problems efficiently are typically faced wit 
onerous task of rewriting an application to use explicit I/O operations (e.g., read/write). In this paper, we pro 
and evaluate a fully automatic technique which liberates the programmer from this task, provides high perfor 
and requires only minima ... 

Keywords: compiler optimization, prefetching, virtual memory 
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Freddy Gabbay, Avi Mendelson 

August 1998 ACM Transactions on Computer Systems (TOCS), volume i6 issue 3 

Full text available: "g j pdf(2S9.77 KB) Additional Information: fiiii citation, abstract , references, citings, i ndex ter-ns 

This article presents an experimental and analytical study of value prediction and its impact on speculative ex 
in superscalar microprocessors. Value prediction is a new paradigm that suggests predicting outcome values o 
operations (at run-time ) and using these predicted values to trigger the execution of true-data-dependent 
operations speculatively. As a result, stals to memory locations can be reduced and the amount of instruction 
parallelism can be extended beyond the limi ... 

Keywords: speculative execution, stride value prediction, value prediction 



Efficient and flexible value sampiin g 

M. Burrows, U. Eriingson, S-T. A. Leung, M. T. Vandevoorde, C. A. Waldspurger, K. Walker, W. E. Weihl 
November 2000 Proceedings of the ninth international conference on Architectural support for program 
languages and operating systems, volume 34 , 28 issue 5 , 5 

Full text available: f)df(191.88 KB) Additional Information: tUil cit ation, abstract, references , cmiioi, index terrns 

This paper presents novel sampling-based techniques for collecting statistical profiles of register contents, dat 
values, and other information associated with instructions, such as memory latencies. Values of interest are s 
in response to periodic interrupts. The resulting value profiles can be analyzed by programmers and optimizer 
improve the performance of production uniprocessor and multiprocessor systems. Our value sampling system 
extends the DCPI continuous profiling infrastructu ... 



Efficient and f iexible value sampiing 

M. Burrows, U. Eriingson, S.-T. A. Leung, M. T. Vandevoorde, C. A. Waldspurger, K. Walker, W. E. Weihl 
November 2000 ACM SIGPLAN Notices, volume 35 issue ii 

Full text available: j:"::if^973.28 KB; Additional Infornnation: full citiation . ab$trg;ct. references , citings , index tenr-s 

This paper presents novel sampling-based techniques for collecting statistical profiles of register contents, dat 
values, and other Information associated with instructions, such as memory latencies. Values of interest are s 
in response to periodic interrupts. The resulting value profiles can be analyzed by programmers and optimizer 
improve the performance of production uniprocessor and multiprocessor systems. Our value sampling system 
extends the DCPI continuous profiling infrastructu ... 



^ 2 A scaiable cr asszgjMGrm.j nfrastructure for ap jDjicalion per fornnance tuning using hard ware counters 
S. Browne, J. Dongarra, N. Garner, K. London, P. Mucci 

November 2000 Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM) 

Full text available: ^ SI ^ ... , 

^JXjf^Ziyj/lB^J^ riiblisher Additional Infornnation: full citation , gsbslrad, reference^^ , citings, index ierrris 

The purpose of the PARI project is to specify a standard API for accessing hardware performance counters ava 

on most modern microprocessors. These counters exist as a small set of registers that count "events'', which 
occurrences of specific signals and states related to the processor's function. Monitoring these events facilitate 
correlation between the structure of source/object code and the efficiency of the mapping of that code to the 
underlying architecture. This ... 



Specula tive exec ution excep tion recov ery using vvrite-back suppression 

Roger A. Bringmann, Scott A. Mahike, Richard E. Hank, John C. Gyllenhaal, Wen-mei W. Hwu 

December 1993 Proceedings of the 26th annual international symposium on Microarchitecture 

Full text available: ■^pdf{J.22.iy1Bi Additional Information: xulLcMion, references, .citjngs 



Keywords: VLIW, exception detection, exception recovery, scheduling, speculative execution, superscalar 
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Huiyang Zhou, Thomas M. Conte 

June 2003 Proceedings of the 17th annual international conference on Supercomputing 

Full text available: "g ^ Ddf<302.33 KB) Additional Information: fiili ciiation , abstract , references, index terft^s 

The ever-increasing computational power of contemporary microprocessors reduces the execution time spent 
arithmetic computations (i.e., the computations not Involving slow memory operations such as cache misses) 
significantly. Therefore, for memory intensive workloads. It becomes more important to overlap multiple each 
misses than to overlap slow memory operations with other computations. In this paper, we propose a novel 
technique to parallelize sequential cache misses, thereby increasing m ... 

Keywords: memory disambiguation, memory level parallelism, prefetching, recovery-free value prediction 



^ D AISY: d y namic compilation tor 100% arc hitec tural compatiblHty 
Kemal Ebcloglu, Erik R. Altman 

May 1997 ACM SIGARCH Computer Architecture News , Proceedings of the 24th annual Internationa 

symposium on Computer architecture, Voiunne 25 issue 2 
Full text available: pdf(1.97 iViB; Additional Information: full citation , abstract , refeirence ^;, citing^; , ir-dex terrr^^i 

Although VLIW architectures offer the advantages of simplicity of design and high issue rates, a major impedi 
their use is that they are not compatible with the existing software base. We describe new simple hardware fe 
for a VLIW machine we call DAISY {Dynamically Architected Instruction Set from Yorktown), DAISY is sped 
intended to emulate existing architectures, so that all existing softwa ... 

Keywords: binary translation, dynamic compilation, instruction-level parallelism, object code compatible VLI 
superscalar 
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September 2001 Journal on Educational Resources in Computing (JERIC) 

Full text available: " Wj odr(613.53 KB) html ...... * ^- n ■ r ^ . ^ * 

— : Additional Infornnation: ruli citation, references, citinas, index terms 
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^ ^ Avoiding, in itializMi^^ 

Jarrod A. Lewis, Bryan Black, Mikko H. Lipasti 

May 2002 ACM SIGARCH Computer Architecture News, volume so issue 2 

Full text available:^ -^'^ sso^ SI a uf u 

"gl part ; .2^j jV^p , ; . ? ^^Msrier Additional Infornnation: full citation, abstract, references, index terms 

Site 

This paper Investigates a class of main memory accesses {invalid memory traffic) that can be eliminated altog 
Invalid memory traffic is real data traffic that transfers invalid data. By tracking the initialization of dynamic m 
allocations, it is possible to identify store instructions that miss the cache and would fetch uninitialized heap d 
The data transfers associated with these Initialization misses can be avoided without losing correctness. The m 
system property cr ... 

Keywords: invalid memory traffic, initializing stores, cache installation, allocation range cache 



SDeedinaAjPjrreguiar.apjj 
Zheng Zhang, Josep Torreilas 

May 1995 ACM SIGARCH Computer Architecture News , Proceedings of the 22nd annual internation 
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While many parallel applications exhibit good spatial locality, other important codes in areas like graph proble 
solving or CAD do not. Often, these irregular codes contain small records accessed via pointers. Consequently 
the former applications benefit from long cache lines, the latter prefer short lines. One good solution is to com 
short lines with prefetching. In this way, each application can exploit the amount of spatial locality that it has 
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However, prefetching, if provided, ... 
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Thread-level speculation is a technique that enables parallel execution of sequential applications on a multipro 
This paper describes the complete implementation of the support for threadlevel speculation on the Hydra chi 
multiprocessor (CMP). The support consists of a number of software speculation control handlers and modifica 
to the shared secondary cache memory system of the CMP This support is evaluated using five representative 
integer applications. Our results show that the s ... 
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Compiler optimizations are often driven by specific assumptions about the underlying architecture and 
implementation of the target machine. For example, when targeting shared-memory multiprocessors, parallel 
programs are compiled to minimize sharing, in order to decrease high-cost, inter-processor communication. T 
paper reexamines several compiler optimizations in the context of simultaneous multithreading (SMT), a proc 
architecture that Issues instructions from multiple threads to the f ... 

Keywords: cache size, compiler optimizations, cyclic algorithm, fine-grained sharing, instructions, inter-proc 
communication, inter-thread instruction-level parallelism, latency hiding, loop tiling, loop-iteration scheduling 
memory system resources, optimising compilers, parallel architecture, parallel programs, performance, proce 
architecture, shared-memory multiprocessors, simultaneous multithreading, software speculative execution 
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